On the efficient execution of bounded Jaro-Winkler distances
نویسندگان
چکیده
Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. Thus, efficient and effective measures for comparing the labels of resources are central to facilitate the discovery of links between datasets on the Web of Data as well as their integration and fusion. We present a novel time-efficient implementation of filters that allow for the efficient execution of bounded JaroWinkler measures. We evaluate our approach on several datasets derived from DBpedia 3.9 and LinkedGeoData and containing up to 10 strings and show that it scales linearly with the size of the data for large thresholds. Moreover, we also show that our approach can be easily implemented in parallel. We also evaluate our approach against SILK and show that we outperform it even on small datasets.
منابع مشابه
Time-efficient execution of bounded Jaro-Winkler distances
Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. A considerable portion of the information contained available as RDF on the Web pertains to persons. Thus, efficient and effective measures for comparing names are central to facilitate the integration of informati...
متن کاملOn Flexible Web Services Composition Networks
The semantic Web service community develops efforts to bring semantics to Web service descriptions and allow automatic discovery and composition. However, there is no widespread adoption of such descriptions yet, because semantically defining Web services is highly complicated and costly. As a result, production Web services still rely on syntactic descriptions, key-word based discovery and pre...
متن کاملNew robust and secure alphabet pairing Text Steganography Algorithm
Steganography has been practiced since ancient times. Many Linguistic Steganography (popularly known as Text based Steganography) algorithms have been proposed like Word Spacing, Substitution, Adjectives, Text Rotation, Mixed Case Font etc.. Information Hiding effectively means that the method/technique should be Robust, Secure and have good Embedding capacity. Measure of Similarity between cov...
متن کاملEvaluating String Comparator Performance for Record Linkage
We compare variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator. We apply the comparators to Census data to see which are better classifiers for matches and nonmatches, first by comparing their classification abilities using a ROC curve based analysis, then by considering a direct comparison between two candidate comparators in record linkage results.
متن کاملName Phylogeny: A Generative Model of String Variation
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were inst...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Semantic Web
دوره 8 شماره
صفحات -
تاریخ انتشار 2017